Graphics in R

Basic Plots

Setup

Run the Setup.R file.

If everything works correctly, you should see a plot:

ggplot2 In a Nutshell

  • Package for statistical graphics
  • Developed by Hadley Wickham
  • Designed to adhere to good graphical practices
  • Supports a wide variety plot types
  • Constructs plots using the concept of layers
  • http://had.co.nz/ggplot2/ or Hadley’s book ggplot2: Elegant Graphics for Data Analysis} for reference material

qplot Function

The qplot() function is the basic workhorse of ggplot2

  • Produces all plot types available with ggplot2
  • Allows for plotting options within the function statement
  • Creates an object that can be saved
  • Plot layers can be added to modify plot complexity

qplot Structure

The qplot() function has a basic syntax:

qplot(variables, plot type, dataset, options)

  • variables: list of variables used for the plot
  • plot type: specified with a geom = statement
  • dataset: specified with a data = statement
  • options: there are so, so many options!

Diamonds Data

Objective: Explore the diamonds data set (preloaded along with ggplot2) using qplot for basic plotting.

The data set was scraped from a diamond exchange company data base. It contains the prices and attributes of over 50,000 diamonds.

Examining the Diamonds Data

What does the data look like?

Look at the top few rows of the diamond data frame to find out!

head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

Basic Scatterplot

Basic scatter plot of diamond price vs. carat weight

qplot(carat, price, geom = "point", data = diamonds)

Another Scatterplot

Scatter plot of diamond price vs carat weight showing versitility of options in qplot

qplot(carat, log(price), geom = "point", data = diamonds, 
    alpha = I(0.2), color = color, 
    main = "Log price by carat weight, grouped by color") + 
    xlab("Carat Weight") + ylab("Log Price")

Your Turn

All of the “Your Turns” for this section will use the tips data set:

tips <- read.csv("https://bit.ly/2gGoiLR")
  1. Use qplot to build a scatterplot of variables tips and total bill
  2. Use options within qplot to color points by smokers
  3. Clean up axis labels and add main plot title

Your Turn

Solutions

Scatterplot of variables tips and total bill

qplot(data = tips, x = total_bill, y = tip)

Your Turn

Solutions

Color points by smokers

qplot(data = tips, x = total_bill, y = tip, 
      color = smoker)

Your Turn

Solutions

Pretty axis lables and title

qplot(data = tips, x = total_bill, y = tip, 
      color = smoker,
      xlab = "Total Bill ($)",
      ylab = "Tip ($)", 
      main = "Tip left by patrons' total bill and smoking status")

Plotting Map Data

States Data

To make a map, load up the states data and take a look:

states <- map_data("state")
head(states)
##        long      lat group order  region subregion
## 1 -87.46201 30.38968     1     1 alabama      <NA>
## 2 -87.48493 30.37249     1     2 alabama      <NA>
## 3 -87.52503 30.37249     1     3 alabama      <NA>
## 4 -87.53076 30.33239     1     4 alabama      <NA>
## 5 -87.57087 30.32665     1     5 alabama      <NA>
## 6 -87.58806 30.32665     1     6 alabama      <NA>

Basic Map Data

What data is needed in order to plot a basic map?

  • Latitude/longitude points for all map boundaries
  • Which boundary group all lat/long points belong
  • The order to connect points within each group

Basic Map Data

The states data has all necessary information

A Basic Map

A bunch of latitude longitude points…

qplot(long, lat, geom = "point", data = states)

A Bit Better Map

… that are connected with lines in a very specific order.

qplot(long, lat, geom = "path", data = states, group = group) + 
    coord_map()

Polygon vs Path

qplot(long, lat, geom = "polygon", data = states, group = group) + 
    coord_map()

Incorporating Information

  • Add other geographic information by adding geometric layers to the plot
  • Add non-geopgraphic information by altering the fill color for each state
    • Use geom = "polygon" to treat states as solid shapes
    • Show numeric information with color shade/intensity
    • Show categorical information using color hue

Categorical Data

If a categorical variable is assigned as the fill color then qplot will assign different hues for each category.

Load in a state regions dataset:

statereg <- read.csv("https://bit.ly/2i0AFHK")
head(statereg)
##        State StateGroups
## 1 california        West
## 2     nevada        West
## 3     oregon        West
## 4 washington        West
## 5      idaho        West
## 6    montana        West

Joining Data

join or merge the original states data with new info

The left_join function is used for merging**:

library(dplyr)
states.class.map <- left_join(states, statereg, by = c("region" = "State"))
head(states.class.map)
##        long      lat group order  region subregion StateGroups
## 1 -87.46201 30.38968     1     1 alabama      <NA>       South
## 2 -87.48493 30.37249     1     2 alabama      <NA>       South
## 3 -87.52503 30.37249     1     3 alabama      <NA>       South
## 4 -87.53076 30.33239     1     4 alabama      <NA>       South
## 5 -87.57087 30.32665     1     5 alabama      <NA>       South
## 6 -87.58806 30.32665     1     6 alabama      <NA>       South

** More on this later

Plotting the Result

qplot(long, lat, geom = "polygon", data = states.class.map, 
      group = group, fill = StateGroups, color = I("black")) + 
    coord_map() 

Numerical Data & Maps

  • Behavioral Risk Factor Surveillance System
  • 2008 telephone survey run by the Center for Disease Control (CDC)
  • Ask a variety of questions related to health and wellness
  • Cleaned data with state aggregated values posted on website

BRFSS Data Aggregated by State

states.stats <- read.csv("https://bit.ly/2gT95Hc")

##   state.name   avg.wt avg.qlrest2   avg.ht  avg.bmi avg.drnk
## 1    alabama 180.7247    9.051282 168.0310 29.00222 2.333333
## 2     alaska 189.2756    8.380952 172.0992 28.90572 2.323529
## 3    arizona 169.6867    5.770492 168.2616 27.04900 2.406897
## 4   arkansas 177.3663    8.226619 168.7958 28.02310 2.312500
## 5 california 170.0464    6.847751 168.1314 27.23330 2.170000
## 6   colorado 167.1702    8.134715 169.6110 26.16552 1.970501

Join the data again

states.map <- left_join(states, states.stats, by = c("region" = "state.name"))
head(states.map)
##        long      lat group order  region subregion   avg.wt avg.qlrest2
## 1 -87.46201 30.38968     1     1 alabama      <NA> 180.7247    9.051282
## 2 -87.48493 30.37249     1     2 alabama      <NA> 180.7247    9.051282
## 3 -87.52503 30.37249     1     3 alabama      <NA> 180.7247    9.051282
## 4 -87.53076 30.33239     1     4 alabama      <NA> 180.7247    9.051282
## 5 -87.57087 30.32665     1     5 alabama      <NA> 180.7247    9.051282
## 6 -87.58806 30.32665     1     6 alabama      <NA> 180.7247    9.051282
##    avg.ht  avg.bmi avg.drnk
## 1 168.031 29.00222 2.333333
## 2 168.031 29.00222 2.333333
## 3 168.031 29.00222 2.333333
## 4 168.031 29.00222 2.333333
## 5 168.031 29.00222 2.333333
## 6 168.031 29.00222 2.333333

Shade and Intensity

Average # of days in the last 30 days of insufficient sleep

qplot(long, lat, geom = "polygon", data = states.map, 
      group = group, fill = avg.qlrest2) + coord_map()

BRFSS Data by Gender and State

states.sex.stats <- read.csv("https://srvanderplas.github.io/NPPD-Analytics-Workshop/02.Graphics/data/states.sex.stats.csv")
states.sex.stats <- read.csv("https://bit.ly/2hiKFIb")
head(states.sex.stats)
##   state.name SEX   avg.wt avg.qlrest2   avg.ht  avg.bmi avg.drnk    sex
## 1    alabama   1 198.8936    8.648936 177.5729 28.50714 3.033333   Male
## 2    alabama   2 173.0315    9.224771 163.9956 29.21280 2.041667 Female
## 3     alaska   1 203.3919    7.236111 178.3896 28.91494 2.487179   Male
## 4     alaska   2 169.5660    9.907407 163.1296 28.89286 2.103448 Female
## 5    arizona   1 191.3739    5.163793 177.1724 27.63152 2.814286   Male
## 6    arizona   2 156.2054    6.142857 162.7043 26.67683 2.026667 Female

One More Join

states.sex.map <- left_join(states, states.sex.stats, by = c("region" = "state.name"))
head(states.sex.map)
##        long      lat group order  region subregion SEX   avg.wt
## 1 -87.46201 30.38968     1     1 alabama      <NA>   1 198.8936
## 2 -87.46201 30.38968     1     1 alabama      <NA>   2 173.0315
## 3 -87.48493 30.37249     1     2 alabama      <NA>   1 198.8936
## 4 -87.48493 30.37249     1     2 alabama      <NA>   2 173.0315
## 5 -87.52503 30.37249     1     3 alabama      <NA>   1 198.8936
## 6 -87.52503 30.37249     1     3 alabama      <NA>   2 173.0315
##   avg.qlrest2   avg.ht  avg.bmi avg.drnk    sex
## 1    8.648936 177.5729 28.50714 3.033333   Male
## 2    9.224771 163.9956 29.21280 2.041667 Female
## 3    8.648936 177.5729 28.50714 3.033333   Male
## 4    9.224771 163.9956 29.21280 2.041667 Female
## 5    8.648936 177.5729 28.50714 3.033333   Male
## 6    9.224771 163.9956 29.21280 2.041667 Female

Adding Information

Average # of alcoholic drinks per day by state and gender

qplot(long, lat, geom = "polygon", data = states.sex.map, 
      group = group, fill = avg.drnk) + coord_map() + 
    facet_grid(sex ~ .)

Your Turn

  • Use left_join to combine child healthcare data with maps information.
    You can load in the child healthcare data with:
states.health.stats <- read.csv("https://bit.ly/2hRBMq0")
  • Use qplot to create a map of child healthcare undercoverage rate by state

Your Turn

Solutions

library(maps)
library(dplyr)
states <- map_data("state")
states.health.map <- left_join(states, states.health.stats, 
                               by = c("region" = "state.name"))

# Use qplot to create a map of child healthcare undercoverage 
# rate by state
    
qplot(data = states.health.map, x = long, y = lat, 
      geom = 'polygon', group = group, 
      fill = no.coverage) + coord_map()

Your Turn

Solutions

Cleaning Up Your Maps

Use ggplot2 options to clean up your map!

  • Adding Titles + ggtitle(...)
  • Might want a plain white background + theme_bw()
  • Extremely familiar geography may eliminate need for latitude and longitude axes + theme(...)
  • Want to customize color gradient + scale_fill_gradient2(...)
  • Keep aspect ratios correct + coord_map()

Cleaned Up Map

qplot(long, lat, geom = "polygon", data = states.map, 
      group = group, fill = avg.drnk) + 
  coord_map() +  theme_bw() +
  scale_fill_gradient2(
    name = "Avg Drinks",
    limits = c(1.5, 3.5), 
    low = "lightgray", high = "red") + 
  theme(axis.ticks = element_blank(),
        axis.text = element_blank(),
        axis.title = element_blank()) +
  ggtitle("Average Number of Alcoholic Beverages 
          Consumed Per Day by State")

Cleaned Up Map

Your Turn

Use options to polish the look of your map of child healthcare undercoverage rate by state!

Your Turn

Solutions

qplot(data = states.health.map, x = long, y = lat, 
      geom = 'polygon', group = group, fill = no.coverage) + 
  coord_map() + 
  scale_fill_gradient2(
    name = "Child\nHealthcare\nUndercoverage",
    limits = c(0, .2), 
    low = 'white', high = 'red') + 
  ggtitle("Health Insurance in the U.S.\n
          Which states have the highest rates 
          of undercovered children?") +
  theme_minimal() + 
  theme(panel.grid = element_blank(), 
        axis.text = element_blank(),
        axis.title = element_blank())   

Your Turn

Solutions

Plotting Using Layers

Deepwater Horizon Oil Spill

Datasets

NOAA Data: - National Oceanic and Administration - Temperature and Salinity Data in the Gulf of Mexico - Measured using Floats, Gliders and Boats

US Fisheries and Wildlife Data:

  • Animal Sightings on the Gulf Coast
  • Birds, Turtles and Mammals
  • Status: Oil Covered or Not

Both data sets have geographic coordinates for every observation

Loading NOAA Data

NOAA data is a .rdata file so we need to read it specially:

  1. Download the data from http://heike.github.io/rwrks/02-r-graphics/data/noaa.rdata
  2. Run the getwd() command to find your current working directory
  3. Place noaa.rdata in the directory from step 2.
  4. Run the command below:
load("noaa.rdata")

Floats

Take a peek at the top of the floats NOAA data:

head(floats, n = 2)[,1:5]
##   callSign Date_Time JulianDay Time_QC Latitude
## 1 Q4901043 7/12/2010   2455390       1   24.823
## 2 Q4901043 7/12/2010   2455390       1   24.823
head(floats, n = 2)[,6:10]
##   Longitude Position_QC Depth Depth_QC Temperature
## 1   -87.964           1     2        1       29.83
## 2   -87.964           1     4        1       29.65
head(floats, n = 2)[,11:14]
##   Temperature_QC Salinity Salinity_QC  Type
## 1              1    36.59           1 Float
## 2              1    36.58           1 Float

Floats Plot

qplot(Longitude, Latitude, color = callSign, data = floats) + 
    coord_map()

Gliders

qplot(Longitude, Latitude, color = callSign, data = gliders) + 
    coord_map()

Boats

qplot(Longitude, Latitude, color = callSign, data = boats) + 
    coord_map()

Layering

This data has the same context - a common time and common place

  • Want to aggregate information from different sources onto a common plot
  • Start with a common background the lat/long grid
  • Superimpose data onto the grid in layers using ggplot2

Layers Preview

ggplot() +
    geom_path(data = states, aes(x = long, y = lat, group = group)) + 
    geom_point(data = floats, aes(x = Longitude, y = Latitude, color = callSign)) +   
    geom_point(aes(x, y), shape = "x", size = 5, data = rig) + 
    geom_text(aes(x, y), label = "BP Oil Rig", 
              size = 5, data = rig, hjust = -0.1) + 
    xlim(c(-91, -80)) + ylim(c(22,32)) + coord_map()

More Layering

  • Most maps (and many plots) have multiple layers of data.
  • The layers may be from the same or different datasets.
  • ggplot2 makes it easy to add layers to a plot.

To do this we need to understand a little more about the underlying theory…

What is a Plot?

  • A default dataset
  • A coordinate system
  • layers of geometric objects (geoms)
  • A set of aesthetic mappings (taking information from the data and converting into an attribute of the plot)
  • A scale for each aesthetic
  • A facetting specification (multiple plots based on subsetting the data)

Floats Decomposed

Data: floats, states

Mappings:
aesthetic mapping
x Longitude
y Latitude
color CallSign
Scales:
aesthetic scale
x continuous
y continuous
color discrete

Geoms: Points (floats), lines (states)

Facetting: None

qplot vs ggplot

qplot() stands for “quickplot”:

  • Automatically chooses default settings to make life easier
  • Less control over plot construction

ggplot() stands for “grammar of graphics plot”

  • Contructs the plot using components listed in previous slides

qplot vs ggplot

Two ways to construct the same plot for float locations:

qplot(Longitude, Latitude, color = callSign, data = floats) 

Or:

ggplot(data = floats, 
       aes(x = Longitude, y = Latitude, color = callSign)) +
  geom_point() + 
  scale_x_continuous() + 
  scale_y_continuous() + 
  scale_color_discrete()

Brevity

Even ggplot will automatically pick default scales:

ggplot(data = floats, 
       aes(x = Longitude, y = Latitude, color = callSign)) +
  geom_point()

Your Turn

Find the ggplot() statement that creates this plot:

Hint: look at the Floats data for variable ideas

Your Turn

Solutions

ggplot(aes(x = Depth, y = Temperature, color = callSign), 
       data = floats) + 
  geom_point()

What is a Layer?

A layer added ggplot() can be a geom…

  • The type of geometric object
  • The statistic mapped to that object
  • The data set from which to obtain the statistic

… or a position adjustment to the scales

  • Changing the axes scale
  • Changing the color gradient

Layer Examples

Plot Geom Stat
Scatterplot point identity
Histogram bar bin count
Smoother line + ribbon smoother function
Binned Scatterplot rectange + color 2d bin count

More geoms described at http://docs.ggplot2.org/current/

Piecing Things Together

Build a map using NOAA data

  • Coordinate system (mapping Long-Lat to X-Y)
  • Add layer of state outlines
  • Add layer of points for float locations
  • Add layers for Oil Rig marker and label
  • Adjust the range of x and y scales

The Result

ggplot() +
    geom_path(data = states, aes(x = long, y = lat, group = group)) + 
    geom_point(data = floats, aes(x = Longitude, y = Latitude, color = callSign)) +   
    geom_point(aes(x, y), shape = "x", size = 5, data = rig) + 
    geom_text(aes(x, y), label = "BP Oil Rig", size = 5, data = rig, hjust = -0.1) + 
    xlim(c(-91, -80)) + 
    ylim(c(22, 32)) + coord_map()

Your Turn

animal <- read.csv("https://bit.ly/2hNlTUl")
  1. Read in the animal.csv data:
    (Data of animal sightings around the Deepwater Site)
  2. Plot the location of animal sightings on a map of the region
  3. On this plot, try to color points by class of animal and/or status of animal
  4. Advanced: Could we indicate time somehow?
library(lubridate)
animal$month <- month(as.Date(animal$Date_))

Your Turn

Solutions

  1. Plot the location of animal sightings on a map
ggplot() + 
  geom_path(data = states, aes(x = long, y = lat, group = group)) + 
  geom_point(data = animal, aes(x = Longitude, y = Latitude)) + 
  xlim(c(-91, -80)) + ylim(c(24,32)) + coord_map()

Your Turn

Solutions

  1. Plot the location of animal sightings on a map

Your Turn

Solutions

  1. On this plot, try to color points by class of animal and/or status of animal
ggplot() + 
  geom_path(data = states, aes(x = long, y = lat, group = group)) + 
  geom_point(data = animal, aes(x = Longitude, y = Latitude,    
                                color = class)) + 
  xlim(c(-91, -80)) + ylim(c(24,32)) + coord_map()

Your Turn

Solutions

  1. On this plot, try to color points by class of animal and/or status of animal

Your Turn

Solutions

  1. On this plot, try to color points by class of animal and/or status of animal
ggplot() + 
  geom_path(data = states, aes(x = long, y = lat, group = group)) + 
  geom_point(data = animal, aes(x = Longitude, y = Latitude,    
                                color = Condition)) + 
  xlim(c(-91, -80)) + ylim(c(24,32)) + coord_map()

Your Turn

Solutions

  1. On this plot, try to color points by class of animal and/or status of animal

Your Turn

Solutions

  1. Advanced: Could we indicate time somehow?
ggplot() + 
  geom_path(data = states, aes(x = long, y = lat, group = group)) + 
  geom_point(data = animal, aes(x = Longitude, y = Latitude,    
                                color = Condition), alpha = .5) +
  xlim(c(-91, -80)) + ylim(c(24,32)) +
  facet_wrap(~month) + coord_map()  

Your Turn

Solutions

  1. Advanced: Could we indicate time somehow?